2025-08-03
In today’s digital landscape, customer support conversations increasingly take place over chat and social media platforms. These short-form exchanges are often emotionally charged and can signal a customer’s satisfaction, frustration, or potential escalation. Understanding the emotional tone behind these messages can:
Yet, analyzing this kind of shorthand-heavy language presents a unique challenge for traditional sentiment analysis models.
This project explores how VADER (Valence Aware Dictionary for sEntiment Reasoning), a lexicon-based sentiment analysis tool, can classify tone in real customer support messages (Hutto and Gilbert 2014). Its design prioritizes speed and interpretability, making it ideal for short, informal content like tweets and chat messages. VADER’s scoring mechanism is particularly sensitive to social media features such as emojis, capitalization, and punctuation, which are often critical to conveying tone in these environments (K. Barik and Misra 2024).
To build a full machine learning pipeline around VADER, we will use its sentiment scores (positive, neutral, negative) as labels and train an XGBoost classifier using TF-IDF features extracted from the message text. XGBoost is well-suited for this task because it performs efficiently with sparse, high-dimensional data and eliminates the need to hand-label messages or train a separate sentiment model from scratch.
The dataset selected for this project is the “Customer Support on Twitter” dataset from Kaggle, which contains real-world support interactions between users and brands such as Apple, Amazon, and Comcast. The messages are short, informal, and emotionally expressive—closely mirroring real-world customer support scenarios—and make the dataset ideal for sentiment analysis and predictive modeling.
Natural Language Processing (NLP) has become a vital tool for understanding customer sentiment across digital platforms. A variety of approaches have been proposed in the literature, from lexicon-based models such as VADER to machine learning methods like XGBoost. This review highlights the studies that informed the methodological design of our project.
Lexicon based methods remain a powerful choice for analyzing short, informal messages. VADER is particularly effective because it incorporates key linguistic signals such as:
"AWESOME" → increases intensity)! → amplifies sentiment):) → amplifies sentiment)"not good" → polarity reversal)These elements help capture the nuanced sentiment found in customer service conversations that traditional lexicon models may often miss.
Recent research continues to support and expand on VADER’s use. Barik and Misra (K. Barik and Misra 2024) evaluated an improved VADER lexicon in analyzing e-commerce reviews and emphasized its interpretability and processing speed. Chadha and Aryan (Chadha and Aryan 2023) also confirmed VADER’s reliability in sentiment classification tasks, noting its effectiveness in fast-paced business contexts. Youvan (Youvan 2024) offered a comprehensive review of VADER’s core logic, highlighting its treatment of intensifiers, negations, and informal expressions.
While VADER is powerful, it’s limited to its predefined lexicon and rule set. To complement VADER’s labeling, we incorporate XGBoost, an efficient and scalable gradient boosting algorithm, as a supervised classifier.
Lestari et al. (Lestari et al. 2025) compared XGBoost with AdaBoost for movie review classification and found XGBoost achieved higher accuracy and generalizability. Sefara and Rangata (Sefara and Rangata 2024) also found XGBoost to be the most effective model for classifying domain-specific tweets, outperforming Logistic Regression and SVM in both performance and efficiency. Lu and Schelle (Lu and Schelle 2025) demonstrated how XGBoost could be used to extract interpretable feature importance from tweet sentiment, providing a compelling case for our approach.
Before applying VADER, our process began by cleaning the raw tweet text to ensure consistency. We removed URLs, user mentions, and hashtags. While VADER can handle informal text, this step was performed to improve text uniformity and prepare for downstream modeling.
After cleaning, we applied VADER to generate a compound sentiment score for each tweet and label tweets as Positive, Neutral, or Negative based on standardized thresholds. The compound sentiment score is computed as:
\[ \text{compound score} = \frac{\sum_{i=1}^{n} s_i}{\sqrt{\sum_{i=1}^{n} s_i^2} + \alpha} \]
Where \(s_i\) is the sentiment score for each word or token and \(\alpha\) is a normalization constant (typically set to 15).
The final sentiment labels are then assigned using the following thresholds:
Positive if compound ≥ 0.05
Neutral if -0.05 < compound < 0.05
Negative if compound ≤ -0.05
This automated labeling process served as the backbone for our supervised classification model.
For example, a tweet reading:
“I’ve been delayed over an HOUR and STILL no response… this is ridiculous!!!”
| Feature | Detected Element | VADER Response | Score Impact |
|---|---|---|---|
| Capitalization | “HOUR”, “STILL” | Increases intensity | -0.10 |
| Punctuation | “…” and “!!!” | Amplifies negative sentiment | -0.25 |
| Lexicon Match | “ridiculous” | Strong negative valence | -0.25 |
| Overall Tone | Complaint/frustration | Strongly negative | -0.15 |
| Final Compound | -0.75 |
This tweet produces a compound score of –0.75 and is labeled as negative, indicating a clear negative sentiment.
By relying on VADER instead of manual annotation, we create a foundation for downstream supervised learning. This aligns with findings by Lu (2025), who demonstrated that VADER-labeled tweets combined with TF-IDF and XGBoost achieved performance comparable to manually labeled datasets (Lu and Schelle 2025). Next, we turn to feature extraction, to transform our labeled text into numerical form suitable for machine learning.
To convert tweets into numerical features for modeling, we employ Term Frequency–Inverse Document Frequency (TF-IDF), a technique that quantifies how important each word is within the context of both the individual tweet and the overall corpus.
Term Frequency (TF) measures how often a word appears in a single tweet (i.e., domain) relative to the total number of words in that tweet:
\[ \text{TF}_{w_n} = \frac{g_{w_n}^{d_m}}{T_{d_m}} \]
Where:
• \(w_n\) is the \(n^{\text{th}}\) word in domain \(d_m\) (a tweet)
• \(g_{w_n}^{d_m}\) is the number of times word \(w_n\) occurs in domain \(d_m\)
• \(T_{d_m}\) is the total number of words in domain \(d_m\)
Example:
If the word delay appears twice in a 50-word tweet, its term frequency is:
\[ \text{TF}_{w_n} = \frac{2}{50} = 0.04 \]
Inverse Document Frequency (IDF) evaluates how unique or informative a word is across the full set of tweets. Common words receive lower IDF scores, while rare or distinctive words receive higher scores:
\[ \text{IDF}_{w_n} = \log\left(\frac{T_{d_m}}{N_{w_n}}\right) \]
Where:
• \(N_{w_n}\) is the number of documents that contain word w_n
Example:
If delay appears in 5 out of 500,000 tweets, its IDF will be much higher than that of hello, which may appear in 10,000 tweets.
Finally, TF-IDF combines these two metrics to weight each word by how frequently it appears in a tweet and how rare it is across the full dataset:
\[ \text{TF-IDF}_{w_n} = \text{TF}_{w_n} \times \text{IDF}_{w_n} \]
This process highlights terms that are both prominent in a tweet and distinctive across the dataset, making TF-IDF a powerful and interpretable technique for feature extraction in sentiment analysis pipelines (K. Barik and Misra 2024). With our feature matrix ready, we proceeded to modeling.
To model sentiment classifications based on TF-IDF features, we employ XGBoost (eXtreme Gradient Boosting), a scalable and regularized tree ensemble algorithm designed for both accuracy and efficiency. XGBoost builds an additive model by iteratively constructing decision trees that minimize a regularized objective function, which balances prediction accuracy with model simplicity. The objective consists of two components: a convex loss function that measures how well the model fits the data, and a regularization term that penalizes overly complex trees.
Each predicted class label \(\hat{y}_i\) (positive, neutral, negative) is computed as the sum of outputs from \(K\) trees:
\[
\hat{y}_i = \phi(x_i) = \sum_{k=1}^K f_k(x_i), \quad f_k \in \mathcal{F}
\]
Where:
• \(x_i\): The input TF-IDF vector for tweet \(i\)
• \(f_k(x_i)\): The prediction from the \(k^\text{th}\) tree for input \(x_i\)
• \(\sum_{k=1}^K f_k(x_i)\): The sum of predictions for each class
• \(\phi(x_i)\): The combined prediction from all trees
This formula is foundational to XGBoost. It expresses how the final prediction is built up iteratively from multiple decision trees, which is the basis of boosting. In classifying sentiment labels, the accumulated scores are passed through a softmax functions to determine class probabilities.
Example:
Suppose we are using XGBoost to classify the sentiment of a tweet as positive, neutral, or negative, and the model has been trained with \(K\) = 3 boosting rounds (trees) per class.
For a new input tweet \(x_i,\) each of the 3 trees for each class outputs a score which is then summed for each class:
• Positive class score: 1.2 + 0.9 + 1.1 = 3.2
• Neutral class score: 0.5 + 0.6 + 0.3 = 1.4
• Negative class score: 0.8 + 0.7 + 0.6 = 2.1
Since the positive class has the highest total score, the model assigns the label positive.
Once prediction scores are computed, XGBoost must also determine how to train itself to make better predictions through the process of learning optimal tree structure. This is done by minimizing the regularized objective function, which balances prediction accuracy and model complexity:
\[
\mathcal{L}(\phi) = \sum_{i} l(\hat{y}_i, y_i) + \sum_{k} \Omega(f_k)
\] \[
\text{where }\Omega(f) = \gamma T + \frac{1}{2} \lambda \lVert w \rVert^2
\]
• \(l(y_i, \hat{y}_i)\) is our differentiable convex loss function (softmax loss for multiclass classification), measuring how far off the model’s prediction \(\hat{y}_i\) is from the true label \(y_i\).,
• \(f_k\) is the \(k^\text{th}\) decision tree in the ensemble,
• \(T\): the number of leaves on a tree,
• \(w\): the vector of leaf scores (weights),
• \(\gamma\) and \(\lambda\): regularization parameters that control tree complexity.
Therefore, by combining a strong predictive loss with a tree-specific complexity penalty, XGBoost is able to generalize well to new data, outperforming simpler models while remaining computationally efficient (Chen and Guestrin 2016). It also provides feature importance scores, offering insights into which terms most influence predictions—a valuable asset for customer service teams seeking actionable feedback.
With the model trained, we evaluated its performance using several classification metrics.
To understand how well our model performed, we used four core metrics:
These metrics were selected to account for class imbalance, which is common in sentiment data sets. For instance, neutral tweets often dominate volume, while negative tweets are more operationally important in customer service. Therefore, we paid close attention to class-specific precision and recall, especially for the negative class, to ensure that frustrated customer messages were identified without over-triggering on neutral ones (R. Barik and Misra 2024; Gandy and Smith 2025).
Note: Some parts of this project were assisted by ChatGPT for writing support and citation formatting. All content was reviewed and edited by the authors to ensure accuracy and originality.
textcreated_atinbound (customer or company)Surprising number of positive tweets due to resolution acknowledgments and polite brand replies.
Figure: Word clouds by sentiment class
TF-IDF highlights important and rare terms across tweets
| Metric | Value |
|---|---|
| Accuracy | 77.1% |
| Precision | 80.96% |
| Recall | 77.1% |
| F1 Score | 77.45% |
Balanced performance, with emphasis on improving recall for negative sentiment
| Sentiment | Precision | Recall | F1 Score |
|---|---|---|---|
| Negative | 0.74 | 0.68 | 0.71 |
| Neutral | 0.62 | 0.95 | 0.75 |
| Positive | 0.93 | 0.73 | 0.82 |
High recall for Neutral (templated replies)
High precision for Positive
Improved recall for Negative from 64% → 68%
A key goal: identifying dissatisfaction more reliably.
Recall on Negative tweets improved from 64% → 68% using sample weighting
Business Impact
Limitations:
Future Work:
This project successfully demonstrated a scalable approach to sentiment classification in customer support conversations by combining VADER, TF-IDF, and XGBoost. VADER provided fast and interpretable sentiment labels tailored for informal social media language, which we used to train a high-performing supervised classifier.
Achieved:
77% accuracy
0.71 F1 for Negative tone
Fast, scalable tone detection
By automating tone detection in real-time support channels, this framework offers immediate business value. It can help teams prioritize escalations, identify service bottlenecks, and monitor agent interactions at scale. Our findings confirm that interpretable, rule-based sentiment scoring (via VADER) can be successfully integrated with machine learning to support responsive, tone-aware customer engagement.
TEST TEST ## References